NAACL 2025
Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Albuquerque | April 29 – May 4, 2025
FEATURED
AI Chatbots Aren’t Experts on Psych Medication Reactions — Yet
If you think you’re having an adverse drug reaction, it’s best to call a human medical professional, at least for the time being.

FEATURED
AI Chatbots Aren’t Experts on Psych Medication Reactions — Yet
If you think you’re having an adverse drug reaction, it’s best to call a human medical professional, at least for the time being.


By Catherine Barzler, Research Communications
Asking artificial intelligence for advice can be tempting. Powered by large language models (LLMs), AI chatbots are available 24/7, are often free to use, and draw on troves of data to answer questions. Now, people with mental health conditions are asking AI for advice when experiencing potential side effects of psychiatric medicines — a decidedly higher-risk situation than asking it to summarize a report.
One question puzzling the AI research community is how AI performs when asked about mental health emergencies. Globally, including in the U.S., there is a significant gap in mental health treatment, with many individuals having limited to no access to mental healthcare. It’s no surprise that people have started turning to AI chatbots with urgent health-related questions.
Now, researchers at the Georgia Institute of Technology have developed a new framework to evaluate how well AI chatbots can detect potential adverse drug reactions in chat conversations, and how closely their advice aligns with human experts. The study was led by Munmun De Choudhury, J.Z. Liang Associate Professor in the School of Interactive Computing, and Mohit Chandra, a third-year computer science Ph.D. student.
“People use AI chatbots for anything and everything,” said Chandra, the study’s first author. “When people have limited access to healthcare providers, they are increasingly likely to turn to AI agents to make sense of what’s happening to them and what they can do to address their problem. We were curious how these tools would fare, given that mental health scenarios can be very subjective and nuanced.”
De Choudhury, Chandra, and their colleagues will introduce their new framework at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, April 29–May 4.
Putting AI to the Test
Going into their research, De Choudhury and Chandra wanted to answer two main questions: First, can AI chatbots accurately detect whether someone is having side effects or adverse reactions to medication? Second, if they can accurately detect these scenarios, can AI agents then recommend good strategies or action plans to mitigate or reduce harm?
The researchers collaborated with a team of psychiatrists and psychiatry students to establish clinically accurate answers from a human perspective and used those to analyze AI responses.
To build their dataset, they went to the internet’s public square, Reddit, where many have gone for years to ask questions about medication and side effects.
They evaluated nine LLMs, including general purpose models (such as GPT-4o and LLama-3.1), and specialized medical models trained on medical data. Using the evaluation criteria provided by the psychiatrists, they computed how precise the LLMs were in detecting adverse reactions and correctly categorizing the types of adverse reactions caused by psychiatric medications.
Additionally, they prompted LLMs to generate answers to queries posted on Reddit and compared the alignment of LLM answers with those provided by the clinicians over four criteria: (1) emotion and tone expressed, (2) answer readability, (3) proposed harm-reduction strategies, and (4) actionability of the proposed strategies.
The research team found that LLMs stumble when comprehending the nuances of an adverse drug reaction and distinguishing different types of side effects. They also discovered that while LLMs sounded like human psychiatrists in their tones and emotions — such as being helpful and polite — they had difficulty providing true, actionable advice aligned with the experts.
Better Bots, Better Outcomes
The team’s findings could help AI developers build safer, more effective chatbots. Chandra’s ultimate goals are to inform policymakers of the importance of accurate chatbots and help researchers and developers improve LLMs by making their advice more actionable and personalized.
Chandra notes that improving AI for psychiatric and mental health concerns would be particularly life-changing for communities that lack access to mental healthcare.
“When you look at populations with little or no access to mental healthcare, these models are incredible tools for people to use in their daily lives,” Chandra said. “They are always available, they can explain complex things in your native language, and they become a great option to go to for your queries.
“When the AI gives you incorrect information by mistake, it could have serious implications on real life,” Chandra added. “Studies like this are important, because they help reveal the shortcomings of LLMs and identify where we can improve.”
Funding: National Science Foundation (NSF), American Foundation for Suicide Prevention (AFSP), Microsoft Accelerate Foundation Models Research grant program. The findings, interpretations, and conclusions of this paper are those of the authors and do not represent the official views of NSF, AFSP, or Microsoft.


Our research shows even the most sophisticated AI models fall short where it counts most — in embodying, as people do, the lived experience of suffering, healing, and care that underpins human connection in mental health.
Munmun De Choudhury
J. Z. Liang Associate Professor, School of Interactive Computing
Georgia Tech


Our research shows even the most sophisticated AI models fall short where it counts most — in embodying, as people do, the lived experience of suffering, healing, and care that underpins human connection in mental health.
Munmun De Choudhury
J. Z. Liang Associate Professor, School of Interactive Computing
Georgia Tech
Georgia Tech at NAACL 2025
By the Numbers
Partner Organizations
Amazon • Arizona State University • Birla Institute of Technology and Science • Brown University • Columbia University • Cornell University • Dartmouth College • Dhirubhai Ambani Institute Of Information and Communication Technology • Essential AI • Facebook • Georgia Tech • Google DeepMind • Hofstra University • Intel Labs • Korea Advanced Institute of Science & Technology • Meta • Michigan State University • Microsoft • Northwell Health • Pennsylvania State University • Purdue University • Quantexa • Stanford University • Texas A&M University – College Station • Universidad Complutense de Madrid • University of Arizona • University of California, Berkeley • University of California, Los Angeles • University of California, San Diego • University of Illinois at Urbana-Champaign • University of Massachusetts at Amherst • University of Michigan – Ann Arbor • University of Toronto • Zhejiang University •
The Big Picture 

Welcome to NAACL 2025: Advancing the Frontiers of Natural Language Processing
I am delighted to welcome you to NAACL 2025, the largest and most dynamic gathering in the history of NAACL. As a premier conference on Natural Language Processing, NAACL brings together researchers and practitioners at the forefront of language technologies to share cutting-edge work, exchange ideas, and shape the future of our field.
This year marks a major milestone, with a record 3,185 paper submissions and 719 outstanding papers accepted to the main conference—highlighting both the depth and rapid growth of NLP research.
We are at a transformative moment, driven by the rise of large language models, which are redefining how we understand and generate human language at scale. NAACL 2025 offers a unique opportunity to explore breakthroughs in model development, multilingual and multicultural NLP, emergent reasoning capabilities, and responsible AI—advances that are reshaping the way language technologies interact with the world.
Our keynote speakers reflect the breadth and impact of the field: Mike Lewis, pre-training lead for Meta’s Llama 3 models and a recent speaker in Georgia Tech’s Machine Learning Seminar Series, will share insights into scaling and pre-training state-of-the-art language models; Rada Mihalcea, a leader in cross-cultural NLP, will explore how diversity drives better systems and research; and Josh Tenenbaum, from MIT, will connect human cognition with the future of machine intelligence.
This year’s special theme, “NLP in a Multicultural World,” highlights the growing importance of developing language technologies that reflect and support cultural and linguistic diversity. Researchers at Georgia Tech, among others, have been active in advancing this area, helping to uncover how large language models can better serve people from a wide range of backgrounds.
Join us in celebrating the remarkable progress of our field, connecting with colleagues from around the world, and pushing the boundaries of what’s possible in NLP. Welcome to NAACL 2025!

Alan Ritter
Program Co-Chair, NAACL 2025;
School of Interactive Computing, Georgia Tech
Advancing the Frontiers of Natural Language Processing

Alan Ritter
Program Co-Chair, NAACL 2025;
School of Interactive Computing, Georgia Tech
I am delighted to welcome you to NAACL 2025, the largest and most dynamic gathering in the history of NAACL. As a premier conference on Natural Language Processing, NAACL brings together researchers and practitioners at the forefront of language technologies to share cutting-edge work, exchange ideas, and shape the future of our field.
This year marks a major milestone, with a record 3,185 paper submissions and 719 outstanding papers accepted to the main conference—highlighting both the depth and rapid growth of NLP research.
We are at a transformative moment, driven by the rise of large language models, which are redefining how we understand and generate human language at scale. NAACL 2025 offers a unique opportunity to explore breakthroughs in model development, multilingual and multicultural NLP, emergent reasoning capabilities, and responsible AI—advances that are reshaping the way language technologies interact with the world.
Our keynote speakers reflect the breadth and impact of the field: Mike Lewis, pre-training lead for Meta’s Llama 3 models and a recent speaker in Georgia Tech’s Machine Learning Seminar Series, will share insights into scaling and pre-training state-of-the-art language models; Rada Mihalcea, a leader in cross-cultural NLP, will explore how diversity drives better systems and research; and Josh Tenenbaum, from MIT, will connect human cognition with the future of machine intelligence.
This year’s special theme, “NLP in a Multicultural World,” highlights the growing importance of developing language technologies that reflect and support cultural and linguistic diversity. Researchers at Georgia Tech, among others, have been active in advancing this area, helping to uncover how large language models can better serve people from a wide range of backgrounds.
Join us in celebrating the remarkable progress of our field, connecting with colleagues from around the world, and pushing the boundaries of what’s possible in NLP. Welcome to NAACL 2025!



Large Language Models continue to redefine computing capabilities in 2025. Hear from a few of Tech’s experts at NAACL as they talk about their contributions to this revolutionary technology. The questions they pose—touching on culture, economics, moral norms, and more—signify an age where computing technology is poised to touch every corner of society.


Research Finds Language Models Align Unevenly with Human Social and Moral Norms
We prompted 11 language models (GPT-4o, Gemini, Llama-3, Arctic, and others) with 400 rules of thumb (RoTs) from the Social Chemistry 101 dataset.
These RoTs represent everyday social and moral norms, such as:
- “It’s good to work at home.”
- “It is good to be patient.”
- “It is ok to live with a roommate of the opposite sex if you are just friends.”
Each RoT was previously labeled by 50 U.S.-based annotators (from a pool of 100 total), spanning a range of age, gender, and income groups.
We compare model responses to these human norms using ADA-Met, a simple ordinal metric that measures how far a model’s response diverges from the modal human norm across demographic groups.
Recognizing that LLMs are increasingly used for subjective judgments, we emphasize the importance of knowing whose opinions these models reflect. Social and moral norms, which vary across cultures and societies, are central to these judgments. Our study revealed LLMs don’t capture a broad range of human perspectives, risk reinforcing stereotypes, and can contribute to unequal treatment.
Key findings:
- Most models align more closely with younger, higher-income, unmarried individuals
- Some models refused to answer sensitive RoTs, limiting normative coverage
- Prompt structure (e.g., using markdown tables) improves alignment
Huge thanks to my co-authors Agam Shah, Dipanwita Guhathakurta, Poojitha Nandigam, Sudheer Chava, and the Georgia Tech Financial Services Innovation Lab.

ML Ph.D. student at Georgia Tech


Sudheer Chava
Alton M. Costley Chair Professor • Finance


ML Ph.D. student at Georgia Tech


Wei Xu
Associate Professor • Interactive Computing
Study Reveals Why Language Models Struggle with Arab Culture
Our study investigated why language models (LMs) often show a bias towards Western culture when working with Arabic, a non-Western language. We explored several factors, including the data that LMs are trained on and the linguistic differences between languages. To aid in this, we created a new dataset called CAMeL-2, which contains over 58,000 entities (names, places, etc.) from both Arab and Western cultures with examples in both Arabic and English.
CAMeL-2 was used to test various LMs on tasks like answering questions and identifying entities in both languages. By comparing the models’ performance in English versus Arabic, we aimed to pinpoint the sources of the cultural bias.
Key findings:
- LMs performed better at understanding Arab cultural information when tested in English compared to Arabic.
- LMs struggled in Arabic with high-frequency Arab entities that have multiple meanings (polysemy). For example, a word could be a food and also have another meaning. This issue was less common with Western entities transliterated into Arabic.
- When Arab entities had similar spellings to common words in other languages that use the Arabic script (like Farsi or Urdu), LMs also had more difficulty recognizing them in Arabic.
- The way words are broken down into smaller units (tokens) affected performance. LMs struggled with Arab entities that were tokenized into a single unit, especially if that unit was a polysemous word in Arabic. This problem worsened for models with larger Arabic vocabularies.
The study suggests that cultural bias in LMs isn’t just about the amount of Western data. The unique features of the Arabic language, like words having multiple meanings and similarities to other script-sharing languages, along with how these languages are processed by LMs, also play a significant role. We believe this understanding is crucial for building more equitable multilingual LMs.

Study Reveals Why Language Models Struggle with Arab Culture
Our study investigated why language models (LMs) often show a bias towards Western culture when working with Arabic, a non-Western language. We explored several factors, including the data that LMs are trained on and the linguistic differences between languages. To aid in this, we created a new dataset called CAMeL-2, which contains over 58,000 entities (names, places, etc.) from both Arab and Western cultures with examples in both Arabic and English.
CAMeL-2 was used to test various LMs on tasks like answering questions and identifying entities in both languages. By comparing the models’ performance in English versus Arabic, we aimed to pinpoint the sources of the cultural bias.
Key findings:
- LMs performed better at understanding Arab cultural information when tested in English compared to Arabic.
- LMs struggled in Arabic with high-frequency Arab entities that have multiple meanings (polysemy). For example, a word could be a food and also have another meaning. This issue was less common with Western entities transliterated into Arabic.
- When Arab entities had similar spellings to common words in other languages that use the Arabic script (like Farsi or Urdu), LMs also had more difficulty recognizing them in Arabic.
- The way words are broken down into smaller units (tokens) affected performance. LMs struggled with Arab entities that were tokenized into a single unit, especially if that unit was a polysemous word in Arabic. This problem worsened for models with larger Arabic vocabularies.
The study suggests that cultural bias in LMs isn’t just about the amount of Western data. The unique features of the Arabic language, like words having multiple meanings and similarities to other script-sharing languages, along with how these languages are processed by LMs, also play a significant role. We believe this understanding is crucial for building more equitable multilingual LMs.

ML Ph.D. student at Georgia Tech


Wei Xu
Associate Professor • Interactive Computing

UNLEARN Forgets Knowledge to Preserve Data Privacy, Shows Big Gains
This research tackles the growing need to efficiently remove specific information from large language models (LLMs) without having to retrain the entire model. The paper introduces a new technique called UNLEARN that can selectively forget knowledge, which is increasingly important due to data privacy regulations like the ‘Right to be Forgotten’ laws. Traditional methods for removing knowledge are often inefficient and can negatively impact other knowledge the model has learned.
The UNLEARN method works by first identifying the specific area (subspace) within the LLM’s internal workings that is responsible for the knowledge to be forgotten. It then uses a process called subspace discrimination to separate this targeted knowledge from similar knowledge, ensuring that removing one doesn’t harm the other. UNLEARN can then remove the identified subspace, effectively making the model forget the targeted information.
Key Findings:
- UNLEARN achieved a high forgetting rate of 96% on targeted knowledge while maintaining performance on dissimilar tasks within 2.5% of the original model’s performance. This demonstrates a significant improvement over previous methods in selectively removing information.
- When dealing with similar tasks, UNLEARN achieved nearly 80% forgetting on the targeted task while preserving performance on similar tasks within 10%. This highlights UNLEARN’s ability to discriminate between closely related knowledge, a challenge for existing unlearning techniques.
- The study also introduced LEARN, a dual method to UNLEARN, which can add new knowledge to an LLM and match the fine-tuning accuracy of LoRA without negatively affecting other tasks. This showcases the versatility of the underlying approach for both knowledge removal and addition.
UNLEARN represents a significant advancement in the ability to efficiently and precisely remove knowledge from LLMs without requiring access to the training data and without causing unwanted side effects on other learned information. This has important implications for data privacy, security, and the efficient adaptation of large language models.

ECE Ph.D. student at Georgia Tech


Larry Heck
Professor • Electrical and Computer Engineering & Interactive Computing

VL‑Time Method Enables Language Models to Reason about Time Series Data
Our research team explored the capability of large language models (LLMs) to reason about time‑series data, which are recordings of data changing over time and common in many real‑world applications. We found that LLMs often struggle with this type of reasoning when the data is presented as simple numbers. To better understand these limitations, we created TimerBed, a new and comprehensive testbed for evaluating how well LLMs can handle time‑series reasoning. TimerBed includes different types of reasoning tasks based on real‑world data, uses various advanced LLMs and reasoning strategies, and provides benchmarks for comparison.
Our work revealed that LLMs generally perform poorly in time‑series reasoning when directly given numerical data, often performing no better than random guessing. This failure might be because it’s difficult for LLMs to extract important features and handle the long sequences of numbers typically found in time‑series data. To address this issue, we developed a new prompt‑based method called VL‑Time. Instead of feeding numbers directly, VL‑Time uses visualizations (like graphs) of the time‑series data, combined with language‑based instructions to guide the LLMs’ reasoning process.
Key findings:
- VL-Time significantly improves the ability of multimodal LLMs to reason about time series. It achieved an average performance improvement of 140% and up to 433% compared to using numerical data directly.
- It enables multimodal LLMs to perform non-trivial zero-shot reasoning on time series, meaning they can reason about new tasks without prior examples. This is a notable improvement from the near-random performance observed when using numerical data in a zero-shot setting.
- VL-Time makes LLMs powerful few-shot reasoners for time series. With just a few examples, VL-Time allowed LLMs to outperform all tested supervised time-series models on tasks involving simple and complex deterministic reasoning.

CS Ph.D. student at Georgia Tech


B. Aditya Prakash
Associate Professor • Computational Science and Engineering
RESEARCH 
Main
Computational Social Science and Cultural Analytics
Communication Makes Perfect: Persuasion Dataset Construction via Multi-LLM Communication
Weicheng Ma, Hefan Zhang, Ivory Yang, Shiyu Ji, Joice Chen, Farnoosh Hashemi, Shubham Mohole, Ethan Gearey, Michael Macy, Saeed Hassanpour, Soroush Vosoughi
Generation
Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
Human-Centered NLP
Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use
Mohit Chandra, Siddharth Sriraman, Gaurav Verma, Harneet Singh Khanuja, Jose Suarez Campayo, Zihang Li, Michael L. Birnbaum, Munmun De Choudhury
Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens
Information Extraction
GLiREL – Generalist Model for Zero-Shot Relation Extraction
Jack Boylan, Chris Hokamp, Demian Gholipour Ghalandari
Language Modeling
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Low-resource Methods for NLP
Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations
Ziqiao Ma, Zekun Wang, Joyce Chai
NLP Applications
A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization
Haoxin Liu, Chenghao Liu, B. Aditya Prakash
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi Yang
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
Ryan Li, Yanzhe Zhang, Diyi Yang
Phonology, Morphology, and Word Segmentation
The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals
Xiaofeng Wu, Karl Stratos, Wei Xu
Resources and Evaluation
CausalEval: Towards Better Causal Reasoning in Language Models
Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Dawei Li, Zhikai Chen, Xiaoze Liu, Liangming Pan
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael Littman, Stephen Bach
Special Theme
A Survey of NLP Progress in Sino-Tibetan Low-Resource Languages
Shuheng Liu, Michael Best
Is It Navajo? Accurate Language Detection for Endangered Athabaskan Languages
Ivory Yang, Weicheng Ma, Chunhui Zhang, Soroush Vosoughi
On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Tarek Naous, Wei Xu
Findings
Dialogue and Interactive Systems
Adapting LLM Agents with Universal Communication Feedback
Kuan Wang, Yadong Lu, Michael Santacroce, Yeyun Gong, Chao Zhang, yelong shen
Ethics, Bias, and Fairness
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
Information Extraction
BioEL: A Comprehensive Python Package for Biomedical Entity Linking
Prasanth Bathala, Christophe Ye, Batuhan Nursal, Shubham Lohiya, David Kartchner, Cassie S. Mitchell
Low-resource Methods for NLP
UNLEARN Efficient Removal of Knowledge in Large Language Models
Tyler Lizzo, Larry Heck
NLP Applications
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, Munmun De Choudhury
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Harsh Nishant Lalai, Aashish Anantha Ramakrishnan, Raj Sanjay Shah, Dongwon Lee
Special Theme
How Inclusively do LMs Perceive Social and Moral Norms?
Michael Galarnyk, Agam Shah, Dipanwita Guhathakurta, Poojitha Nandigam, Sudheer Chava


See you in Albuquerque!
Development: College of Computing
Project and Web Lead/Data Graphics: Joshua Preston
Featured News: Catherine Barzler
Photography: Kevin Beasley and Terence Rushin; submitted photos
Data: https://2025.naacl.org/program/accepted_papers/